feat(bigtable): add client side metric instrumentation to basic rpcs#16712
feat(bigtable): add client side metric instrumentation to basic rpcs#16712daniel-sanche wants to merge 8 commits intomainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request integrates client-side metrics tracking into the Bigtable data client for both asynchronous and synchronous implementations. It wraps key operations—including sample_row_keys, mutate_row, check_and_mutate_row, and read_modify_write_row—with metrics collection logic and introduces a tracked_retry wrapper to monitor retry attempts. Additionally, the PR refactors system tests by consolidating fixtures into a shared SystemTestRunner class and adds new system tests specifically for metrics. Feedback focuses on regressions in retry logic where critical arguments like sleep_generator and exception_factory were omitted in the transition to tracked_retry. There are also suggestions to improve resource cleanup in test fixtures and to relax restrictive timing assertions in tests to prevent flakiness.
| ), | ||
| clusters=cluster_config, | ||
| ) | ||
| operation.result(timeout=240) |
There was a problem hiding this comment.
If operation.result(timeout=240) raises an exception (e.g., a TimeoutError), the fixture will stop execution and the delete_instance call in the teardown phase will never be reached. This can lead to leaked Bigtable instances in the test project. Consider wrapping the creation and yield in a try...finally block or ensuring cleanup happens even on creation failure.
| operation.zone | ||
| == cluster_config[operation.cluster_id].location.split("/")[-1] | ||
| ) | ||
| assert operation.duration_ns > 0 and operation.duration_ns < 1e9 |
There was a problem hiding this comment.
The assertion operation.duration_ns < 1e9 (1 second) might be too restrictive for system tests running against a live backend. Network latency or backend load could easily cause an RPC to exceed 1 second, leading to flaky tests. It is recommended to remove this upper bound or increase it significantly.
| assert operation.duration_ns > 0 and operation.duration_ns < 1e9 | |
| assert operation.duration_ns > 0 |
mutianf
left a comment
There was a problem hiding this comment.
some final nit otherwise lgtm
| tuple[Exception, Exception|None]: | ||
| tuple of the exception to raise, and a cause exception if applicable | ||
| """ | ||
| exc_list = exc_list.copy() |
There was a problem hiding this comment.
why do we need to copy the exception list now?
There was a problem hiding this comment.
This is cleaner, we wouldn't expect a factory method like this to modify its input arguments. This creates an isolated reference
IIRC I think this only came up in test code though
| # validate operation | ||
| operation = handler.completed_operations[0] | ||
| assert isinstance(operation, CompletedOperationMetric) | ||
| assert operation.final_status.value[0] == 0 |
There was a problem hiding this comment.
why is this final_status.value[0] and not just final_status.value? or final_status.name = 'OK'?
There was a problem hiding this comment.
value is a tuple of (0, "ok"). We could just do final_status.name == 'OK' if you think that's clearer though
| for i in range(num_retryable): | ||
| attempt = handler.completed_attempts[i] | ||
| assert isinstance(attempt, CompletedAttemptMetric) | ||
| assert attempt.end_status.name == "ABORTED" |
There was a problem hiding this comment.
carry over my comment from before, I don't think we retry aborted error for mutate row by default, so maybe use unavailable instead? And also a little suprised that this test passes?
There was a problem hiding this comment.
We allow the user to set custom retryable errors, which is what the test is doing here. So that's why it passes.
I can't remember for sure why I chose to use explicit retryable errors here instead of the default. Maybe just so the test makes sense even if you don't know the defaults. I could probably switch it to UNAVAILABLE though if you prefer
| ) | ||
| return [(s.row_key, s.offset_bytes) async for s in results] | ||
|
|
||
| return await tracked_retry( |
There was a problem hiding this comment.
where did sleep_generator and exception_factory go?
There was a problem hiding this comment.
tracked_retry contains them both. needed a custom versions of sleep_generator to report backoff, and exception_factory to report terminal errors metrics module. See go/bigtable-csm-python
| timeout=operation_timeout, | ||
| retry=None, | ||
| ) | ||
| return result.predicate_matched |
There was a problem hiding this comment.
this is not wrapped in tracked_retry. Will the attempt level metrics (attempt latencies, server latencies and connectivity error count) still be recorded?
There was a problem hiding this comment.
check_and_mutate shouldn't have retries, right?
But yes, a sngle attempt is will be recorded when this completes. The duration/gfe_latency data is captured, and will be exported as those metrics in the follow-up PR
| retry=None, | ||
| ) | ||
| # construct Row from result | ||
| return Row._from_pb(result.row) |
There was a problem hiding this comment.
same question as check and mutate.
Migration of googleapis/python-bigtable#1188 to the monorepo
This PR builds off of googleapis/python-bigtable#1187 to add instrumentation to basic data client rpcs (check_and_mutate, read_modify_write, sample_row_keys, mutate_row)
Metrics are not currently being exported anywhere, just collected and dropped. A future PR will add a GCP exporter to the system